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Abstract. We outline how modern likelihood theory, which provides 
essentially exact inferences in a variety of parametric statistical prob- 
lems, may routinely be applied in practice. Although the likelihood 
procedures are based on analytical asymptotic approximations, the fo- 
cus of this paper is not on theory but on implementation and applica- 
tions. Numerical illustrations are given for logistic regression, nonlinear 
models, and linear non-normal models, and we describe a sampling ap- 
proach for the third of these classes. In the case of logistic regression, 
we argue that approximations are often more appropriate than 'exact' 
procedures, even when these exist. 

Key words and phrases: Conditional inference, heteroscedasticity, lo- 
gistic regression, Lugannani-Rice formula, Markov chain Monte Carlo, 
nonlinear model, R, regression-scale model, saddlepoint approximation, 
spline, statistical computing. 



1. INTRODUCTION 

Monte Carlo inference has developed remarkably 
over the last 30 years. Bootstrap procedures (Efron 
(1979)) are used for a wide range of problems 
(Efron and Tibshirani (1993), Davison and Hinkley 
(1997)). Markov chain Monte Carlo simulation has 
transformed Bayesian modelling (Robert and Casella 
(2004)). The combination of iterative simulation with 
importance sampling and improved algorithms for 
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full enumeration of discrete sample spaces has had 
a strong impact on the analysis of contingency ta- 
bles (Forster, McDonald and Smith (1996), Smith, 
Forster and McDonald (1996), Diaconis and Sturmfels 
(1998), Mehta, Patel and Senchaudhuri (2000)). 
More recently there has been a rise in Bayesian non- 
parametric modelling (Denison et al. (2002)), which 
parallels the use of the bootstrap for nonparametric 
frequentist inference. All these techniques use simu- 
lation to avoid tailoring analytical work to specific 
problems. 

Parallel with these developments has been the de- 
velopment of analytical approximations for paramet- 
ric inference in small samples, initiated by Fisher 
(1934) but largely overlooked until new developments 
were stimulated by Efron and Hinkley (1978) and 
Barndorff-Nielsen and Cox (1979). A flood of subse- 
quent work is summarized in the books of 
Barndorff-Nielsen and Cox (1994), Pace and Salvan 
(1997), and Severini (2000). The efforts of many re- 
searchers, particularly O. E. Barndorff-Nielsen, (1983, 
1986) and D. A. S. Eraser (e.g.. Eraser (1990); 
Eraser, Reid and Wu (1999)) and their co-workers, 
have led to an elegant theory of near-exact inference 
based on small samples from parametric models. Its 
theoretical basis is saddlepoint and related approx- 
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imation (Daniels, 1954, 1987), and further devel- 
opments have been well described by Reid (1988, 
1995, 2003). These methods are highly accurate in 
many situations, but are nevertheless under-used 
compared to the simulation procedures mentioned 
above. One reason for this may be their arcane ba- 
sis in the conditionality principle, ancillary statis- 
tics and marginalization, and another may be the 
forbidding technical details, but the main reason is 
undoubtedly the lack of suitable software. Unlike 
the bootstrap libraries available in general-purpose 
languages such as S-PLUS (S-PLUS (2007)) and R 
(R Development Core Team (2007)) or specialized 
software such as WinBUGS (Lunn et al. (2000)) or 
LogXact (Cytel Inc. (2007)), small-sample paramet- 
ric asymptotics have been implemented piecemeal, 
usually by specialists in the area for their personal 
use. 

This paper describes the implementation and use 
of libraries of software for higher order inference for 
several special classes of model: for linear exponen- 
tial families such as logistic regression models, for 
nonnormal linear models and for nonlinear regres- 
sion models with heteroscedastic normal errors. Its 
objective is to make higher order inference for such 
models available for use by those without a com- 
mand of the technical details. We also describe how 
Markov chain Monte Carlo may be used not only 
to assess conditional coverage and related proper- 
ties of some of our methods, but also for inference. 
A related, more extended, account may be found 
in Brazzale, Davison and Reid (2007), which gives 
many further examples. Butler (2007) gives ample 
evidence for the accuracy of the approximations that 
underlie some of the theory used herein. 

Section 2 outlines developments in parametric 
asymptotics that undergird the numerical approx- 
imations whose implementation is described in Sec- 
tion 3. Application to logistic regression is described 
in Section 4, where we argue that the conservatism 
and fragility of exact inference in this context should 
lead us to prefer approximation. In Section 5 we 
discuss regression-scale models with nonnormal er- 
rors, outline how both analytical approximation and 
Markov chain Monte Carlo simulation may be used 
for approximate conditional inference, and compare 
them empirically. Section 6 describes how the ap- 
proximate methods may be applied to nonlinear re- 
gression models, which are often fitted using small 
samples from bioassays or toxicological studies. The 
paper concludes with a brief discussion and appen- 
dices containing technical details. 



2. BACKGROUND 
2.1 First Order Inference 

Initially we consider a parametric statistical model 
with density /(y; 9), where € C M*^ is a d-dimen- 
sional parameter and y = (yi,...,yn) a vector of 
continuous responses. Let L{9) oc f{y;0) denote the 
likelihood and i{6) = log L{9) the log likelihood func- 
tions. Under mild conditions the maximum likeli- 
hood estimate 6 may be found by solving the score 
equation ie{d) = 0, and its asymptotic variance is 
approximated using the inverse of the observed in- 
formation matrix J (^). We distinguish between quan- 
tities of primary interest and others not of direct 
concern by writing 9 = (?/',A), where ■0 is a low- 
dimensional parameter of interest and A is a nui- 
sance parameter whose dimension may be apprecia- 
bly larger than that of 0. This partitioning entails 
corresponding splits of the score vector £0{tp, A) into 
£^(■0, A) and ix{tp,X), and of the observed informa- 
tion function j{tp, A) into the sub-matrices j^^ {tp, A) , 
Uxi'fpA), JA^(0,A) and jaa(0,A). 

Exact inference for linear exponential families and 
location-scale models was discussed by Fisher (1934) 
in a paper largely ignored for many years. Even 
where available, in principle, the effort needed to 
implement exact methods in all but the simplest 
cases means they are seldom used in practice, but 
are typically replaced by asymptotic approximations 
derived by assuming that the sample size n, or, more 
generally, some information index, tends to infinity. 
We then eliminate the nuisance parameter A by re- 
placing it by the constrained maximum likelihood 
estimate A^ obtained by maximizing A) with 
respect to A for fixed ip. Inference about ip may 
then be performed using the profile log likelihood 
function ip{ip) = inaxxi{tjj, X) =£{ip, X^). The cor- 
responding observed information function, jp{ip) = 
—d'^£p{'ip) /dip dtp^ , can be expressed in terms of the 
full observed information function through the iden- 
tity 

where 9^ = {ip,X^). For scalar ip, inference on the 
parameter of iriterest may be based on the Wald 
statistic, jpii'V^'^i''!' ~ V')) score statistic, 
{jp{'ip)}~^^'^itp{ip,X^), or on the likelihood root, 

(1) r{ij)=sign{i^-i;)[2{lp{,p)-£pm]'^^ 
which have standard normal distributions up to the 
order 0(n~^/^). When the sample size is small these 
first order approximations are often inaccurate, es- 
pecially in complex models. 
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2.2 Higher Order Inference 

The keys to refining the limiting behavior of the 
most important likehhood quantities are two higher 
order density approximations: Barndorff-Nielsen's 
(1983) p* formula and the tangent exponential model 
p^gj^j developed by Fraser, Reid and Wu (1999). 
Apart from an 0{n~^) norming constant, the first 
gives the density of the maximum likelihood esti- 
mate at the observed data and at other points hav- 
ing the same value of an ancillary statistic, though 
these must be known. The second is an exponential 
model whose distribution function at the observed 
data differs from that of the conditional model by 
0(n~^/^) under the observed conditioning 
(Fraser, Andrews and Wong (2005)). Both approx- 
imations are exact for transformation models and 
give excellent results generally. In full exponential 
families they agree with the saddlepoint approxima- 
tion to the density of the minimal sufficient statistic 
given by Barndorff-Nielsen and Cox (1979). 

These density approximations are mainly useful 
for deriving approximate distribution functions for 
appropriate pivots, from which we obtain P-values 
and confidence intervals for the parameters of in- 
terest. For scalar t/^, these approximate distribution 
functions have the forms 

(2) ^*{r)=^{r) + cl){r)(---\ 

\r qj 

and 

(3) $(r*) = $fr+-log-Y 

\ r r J 

for r given at (1) and q defined as 

where <&(•) and represent the standard normal 
distribution and density functions. Here, ip{9) is a 
reparametrization based at the observed data and 
used to provide a third order distribution function 
approximation through the tangent exponential 
model, and fe{6) and fx{9) represent the dx d ma- 
trix with element d^pi/dOj and the d x (d — 1) 
matrix with element d(pi/dXj. Special expres- 
sions for (4) can be found in Appendix A.l. Ex- 
pression (2) is known Lugannani-Rice-type ap- 
proximation, and the quantity r* in the Barndorff- 
Nielsen-type approximation (3) is known as a mod- 
ified likelihood root. Under ordinary repeated sam- 
pling, approximations (2) and (3) are exact up to 



the third order, that is, 

pr(i? <r;9) = $*(r) + 0{n~^^^), 

pr(ii* <r*;e) = $(r*) + 0(n~^/2). 

In comparison, the likelihood root r itself is standard 
normal only to the first order, 0(n~^/^). A rather 
subtle Taylor series expansion of <^(r*) around r, 
taking into account the dependence of f{9) on the 
observed data point, shows that $(r*) = $*(r) + 
0(n~^/^), rising to 0{n~^) if this dependence is not 
accommodated; in particular the more accurate re- 
sults holds for linear exponential families (Jensen 
(1992)), in which ip{9) does not depend on the ob- 
served data. In an exponential family of order one, 
<I>*(r) equals the Lugannani and Rice (1980) tail area 
approximation. In the presence of nuisance param- 
eters, it gives the approximation due to Skovgaard 
(1987). 

2.3 Related Ideas 

An alternative to the ideas outlined in Section 2.2 
is first to adjust the profile likelihood Lp^ip) = 
exp{^p('(/^)} to account for the presence of nuisance 
parameters, and then to correct the first order statis- 
tics obtained therefrom in order to improve the stan- 
dard normal approximation. Pierce and Peters (1992) 
call these sequential approximations, as contrasted 
with the more common double approximations 
(Barndorff-Nielsen and Cox (1979)). 

The general form of the adjusted profile likelihood 

is 

(5) La(V)=Lp(V')M(V), 

with suitably defined correction term M(^); see Ap- 
pendix A. 2. When an exact conditional or marginal 
likelihood function for ip exists, this is approximated 
to the order 0{n~^) by the adjusted profile likeli- 
hood function. In stratified models with the number 
of nuisance parameters proportional to the number 
of strata, Sartori et al. (1999) showed that M{ip) 
corrects for the presence of the nuisance parame- 
ters. The maximizing value V'a usually has a smaller 
finite-sample bias than does ip, and the likelihood 
root ra('i/') based on La has a distribution that is 
closer to normal than does r('0). Insight into why 
likelihood roots obtained from adjusted likelihoods 
tend to achieve most of the distributional improve- 
ment achieved by higher order methods, despite their 
null distribution being standard normal only to the 
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first order, is provided by Sartori (2003) in a two- 
index asymptotics setting and by DiCiccio and Efron 
(1996), who relate their findings to the bootstrap. 

If the parameter of interest is vector, formula (3) 
cannot be used. Skovgaard (2001) suggests adjust- 
ing the likelihood ratio statistic w{ip) = 2{ip{tp) — 
ip{ip)}, which has the distribution with do = 
dim(^) degrees of freedom up to the order 0(n~^). 
His proposed adjusted likelihood ratio statistic 

(6) w* = w{l — logu}'^ , 

with correction term u suitably defined, is also asymp- 
totically distributed as Xdo ' but behaves much better 
than w in small samples. If is scalar, then u = r/q 
with q given by (4), and (6) reduces to (r*)^. 

The above discussion applies only to continuous 
response models. For discrete responses, analogous 
results are in general unavailable. However, for dis- 
tributions such as the binomial and Poisson whose 
support has, or can easily be transformed to, a lat- 
tice structure, the use of a slightly modified form 
of (2) provides approximations to tail probabilities 
with error O(n-i) (Severini (2000), Section 6.3.3). 
Pierce and Peters (1992) discuss continuity correc- 
tion for the asymptotic approximations. As discussed 
in Section 4, the uncorrected version can be inter- 
preted as an approximation to the mid-P value 
(Agresti (2002), page 20) 

Pmid(a:; ip) = pr(X < x; ^) -h i pr{X = x; ip) 

for a suitable lattice random variable X. Further 
discussion is given by Pierce and Peters (1999), 
Davison and Wang (2002), and Davison, Fraser and 
Reid (2006), who indicate that use of r* unmodified 
in standard discrete cases approximates the mid-P 
value with error of the order 0{n~^). From a prac- 
tical point of view, the most reassuring point is per- 
haps not the precise asymptotic order of these ap- 
proximations, but rather the fact that they are rel- 
ative, and thus give accurate values even for small 
tail probabilities. 

3. IMPLEMENTATION 

3.1 General 

Many journal pages and several books have been 
devoted to the ideas sketched in Section 2, but their 
widespread adoption in practice has been limited by 
the lack of suitable software. The R package bundle 
hoa, short for higher order asymptotics, is intended 



to make these methods readily accessible by pro- 
viding easy-to-use and self-contained code for rou- 
tine data analysis with logistic regression models, 
nonnormal linear models and nonlinear models with 
nonconstant variance (Brazzale, 2005). These mod- 
els are widely used in applications: logistic regres- 
sion is a common tool in epidemiology and medicine; 
nonnormal linear models comprise models used in 
survival and reliability analyses; and nonlinear het- 
eroscedastic models are increasingly used in biostatis- 
tics, for instance, in herbicide bioassays and eco- 
toxicity tests. The code is organized as four pack- 
ages, three of which — cond, marg and nlreg — refer 
to the model classes just mentioned. A fourth — 
c sampling — contains conditional sampling routines 
for nonnormal linear models and was used to pro- 
duce the results presented in Section 5. The code is 
freely available from http://statwww.epfl.ch/AA 
or can be downloaded from CRAN (http://cran. 
r-project.org). 

The remainder of this section sketches the core 
ideas that make it possible to implement higher or- 
der asymptotics in a numerical computing environ- 
ment with limited facilities for algebraic manipu- 
lation. The issues inherent to the implementation 
for logistic and nonlinear models are described in 
Brazzale (1999) and Bellio and Brazzale (2003), re- 
spectively. The complete design strategy can be found 
in Brazzale (2000), Chapter 6. 

3.2 Building-Blocks 

The complexity of the algebraic expressions in- 
volved is the main obstacle to the implementation 
of higher order asymptotics in numerical computing 
environments, most of which have inefficient sym- 
bolic manipulation capabilities, if any at all. The 
key to our implementation of the methods presented 
in Section 2 is to identify building-blocks into which 
the higher order statistics can be decomposed and 
which are provided or can efficiently be handled by 
the computing device. This builds upon the obser- 
vation of Davison (1988) that in linear exponen- 
tial families the output of standard fitting routines 
suffices to calculate the q{tp) and M(ip) correction 
terms of Section 2. Brazzale (2000), Section 6.1, de- 
rives the corresponding building-blocks for nonnor- 
mal linear and nonlinear heteroscedastic regression 
models. The first of these classes is characterized by 
the design matrix X , the standardized residuals a, 
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minus the logarithm of the density function of the er- 
ror term, go{e) = — log /(e), and its first two deriva- 
tives (see Brazzale, Davison and Reid (2007), Sec- 
tion 8.6.2). For nonlinear regression models the only 
requirements are the mean and variance functions 
fj,{x]P) and w{x;(3,pY and their first two deriva- 
tives (see Brazzale, Davison and Reid (2007), Sec- 
tion 8.6.3). Two design strategies were adopted to 
make these quantities available in hoa: either they 
are provided by special constructs, called family ob- 
jects, or they are derived as needed by exploiting the 
algebraic manipulation function derivS available in 
R; see Bellio and Brazzale (2003). 

3.3 Pivot Profiling 

Inferences provided by fitting routines available in 
statistical computing environments such as S-PLUS 
and R are generally based upon first order asymp- 
totics. Most often the Wald statistic is used, because 
its linearity in the interest parameter ijj yields sim- 
ple readily computed confidence intervals. The like- 
lihood ratio statistic is parametrization- invariant, 
and hence more reliable, but its nonlinear ity in ip 
means that construction of confidence intervals en- 
tails re-fitting the model for all the required values of 
if). We deal with this by using cubic splines to inter- 
polate values of r('0) and related quantities among 
values calculated exactly for a grid encompassing 
the required range oi tp. In R this may be achieved 
by using different offsets, for each of which the nec- 
essary output is retrieved and statistics calculated. 
Estimates and confidence bounds are read off from 
significance functions such as (2) and (3) using the 
fitted splines. 

Numerical interpolation of higher order solutions 
works very well for analytic functions such as the 
profile and adjusted profile likelihoods, but quanti- 
ties such as r* have a singularity at = ip, and the 
numerical values calculated may be unstable if ^p is 
close to ip. This problem is particularly acute for lo- 
gistic regression. Nonnormal linear models are much 
less affected, and in our experience numerical insta- 
bilities are almost absent for nonlinear models. To 
avoid singularities in the first two model classes, we 
implemented a hybrid algorithm that uses two-step 
polynomial interpolation of the higher order statis- 
tics for values of V iii a small interval around ip. 
The higher order solutions are expressed as polyno- 
mials of the likelihood root r, which itself is written 
polynomial in ip, and the coefficients of these 



polynomials are estimated by least squares. For non- 
linear models, it suffices to avoid exact computation 
of the higher order statistics for values of ip very 
close to the maximum likelihood estimate. 

The procedure just described represents the bulk 
of the approximate conditional inference routines in 
the cond, marg and nlreg packages, which enable 
inference for the three model classes of the hoa bun- 
dle. See Section 6.3.2 and Appendix B.2 of Brazzale 
(2000) for further details. 

3.4 Markov Chain Monte Carlo 

The generation of observations conditional on an 
ancillary statistic is useful for inferential purposes 
such as the calculation of confidence intervals and 
P-values whenever the exact conditional density is 
unknown or difficult to obtain without simulation. 
Such an approach is described in an unpublished 
technical report by Casella, Wells and Tanner (1994), 
who emphasize sampling-based calculations for piv- 
otal inference using the Gibbs sampler. 

Conditional inference may also be used to assess 
the quality of small-sample solutions. Studies of the 
properties of the methods presented in Section 2 
(DiCiccio, Field and Fraser (1990), DiCiccio and 
Field (1991), Ronchetti and Ventura (2000)) focused 
on their numerical accuracy, stability and sensitivity 
to model failure. So far as we know, there has been 
no numerical investigation of these properties condi- 
tional on an ancillary statistic; Severini (1999) and 
Ventura (1997) grouped their simulation results by 
the levels of two nearly independent functions of the 
ancillary, but this does not amount to a fully con- 
ditional simulation. Trotter and Tukey (1956) were 
the first to simulate conditionally on an ancillary 
statistic in the special case of normal samples, 
but there have been few contributions since (Durbin 
(1961); Bondesson (1982); Fraser, Lee and Reid 
(1990); Morgenthaler and Tukey (1991)). 

The csampling package of the hoa bundle includes 
a conditional sampler for general regression-scale 
models and extends Bondesson's (1982) method by 
replacing an acceptance-rejection algorithm by the 
Metropolis-Hastings algorithm (Robert and Casella 
(2004), Chapter 7). Section 5.2 describes a simula- 
tion study performed using this package. 

4. LOGISTIC REGRESSION 

4.1 Likelihood Approximation 

After the linear model, logistic regression of bi- 
nary responses yi,...,yn on covariates {zi,xi), . . . , 
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{zn,Xn) is perhaps the most widely used paramet- 
ric regression procedm'e. Let X, Z and y denote the 
n X p, n X k and n x 1 matrices whose ith. rows are 
respectively xj , zj and i/i . The log likelihood 

n 

- X! 1 + exp(z[ V + a^I A) } 

i=l 

corresponds to a linear exponential family with canon- 
ical parameter {^p, A) and sufficient statistic {t, s) = 
{Z^ y,X^ y), so the higher order approximations de- 
scribed above are particularly simple and can be ob- 
tained by re-arranging the output of standard rou- 
tines for fitting logistic models (Davison (1988)); 
see also Daniels (1958), Barndorff-Nielsen and Cox 
(1979), Pierce and Peters (1992) and Strawderman, 
Casella and Wells (1996). Such approximations are 
provided by our cond package. 

Example 1 {Urine data). For illustration, we 
take data on the presence or not of crystals in urine 
samples (Andrews and Herzberg (1985), page 249). 
Pull data on the six quantitative covariates are avail- 
able for 77 individuals, and we consider the coeffi- 
cient ij) of the variable urea representing urea con- 
centration (millimoles/litre) in a logistic regression 
model also containing the five other covariates and 
an intercept. The R code for first order and higher 
order inferences is 

> uri.glm <- glm( r ~ gravity + ph + 

+ osmo + conduct + 

+ urea + calc, 

+ family = binomial, 

+ data = urine ) 

> summary ( uri.glm ) 

> uri.urea <- cond( uri.glm, 

+ offset = urea ) 

> summary ( uri.urea ) 

> plot( uri.urea ) 

The first two instructions fit the model by maxi- 
mum likelihood and then print the results, and the 
last four lines of code compute, summarize and plot 
the first and higher order approximations. The max- 
imum likelihood estimate and its standard error are 
^ = -0.0320 (0.0161), yielding a Wald statistic of 
-1.99 with two-sided P-value 2^>(-1.99) = 0.047. 
An approximation to the conditional maximum like- 
lihood estimate and its standard error is obtained by 
maximizing the adjusted profile log likelihood and 
taking its curvature at the maximum; this gives V'a = 



—0.0276 (0.0149), yielding an approximate condi- 
tional Wald statistic of —1.85 and P-value 0.064. 
The 95% confidence intervals for tp based on these 
two Wald pivots are respectively (—0.0636, —0.0004) 
and (—0.0568,0.0016), and those based on the like- 
lihood root r, and the modified likelihood root r* 
are (-0.0668,-0.0025) and (-0.0587,0.0005). Thus, 
the estimated coefficient changes by around 14%, 
more than might be anticipated with six nuisance 
parameters and 77 observations, and there are cor- 
responding changes to the confidence intervals. 

Figure 1 shows two of the graphs provided by the 
plot command: note the large difference between 
first and higher order inference summaries, which 
suggests that the latter should be used as a mat- 
ter of course with binary data models. The adjusted 
profile likelihood corrects for the finite sample bias 
in the maximum likelihood estimator, in analogy to 
the conditional likelihood function, while the modi- 
fied likelihood root also accounts for the nonnormal- 
ity of the ordinary and adjusted likelihood functions. 

A simple information computation sheds some light 
on the size of the higher order correction in the ex- 
ample above. Suppose that we have independent ob- 
servations y[,...,y'n from the logistic density exp(?/^ — 
A — Ziijj)/{1 + exp(y^ — A — Ziijj)}'^, where the Zi are 
known scalar covariates. The asymptotic variance 
■ycont of the maximum likelihood estimator of ip based 
on the continuous is a corner of the Fisher infor- 
mation matrix, which is easily seen to be 3{X^ X)~^ , 
where X denotes the entire design matrix, whose 
ith row here is (l,Zi). If only the sign of the y'^ is 
known, so the continuous observations are replaced 
by binary variables, the asymptotic variance of the 
maximum likelihood estimator of i/' is a corner of the 
matrix (X^ W X)~^ , where W denotes the n x n di- 
agonal matrix diag{7ri(l — vri), . . . , 7r„(l — 7r„)}, with 
TTj = exp(A -|- Zi%l))/{1 + exp(A + ZjV)}^- The ratio of 
these asymptotic variances gives some idea of the 
information content of the binary data compared 
to the continuous data. Figure 2 shows this ratio 
as a function of the standardized parameter Scout = 
^/(ucont)^''^ when the covariate z takes n = 21 equi- 
spaced values ranging from —3 to 3. The maximum 
value 0.75 occurs when X = tp = 0, but the ratio 
drops fast as 5cont increases. A value of Jcont = 5 that 
would be easily distinguished from using the con- 
tinuous data would correspond to a value of around 
2 based on the binary data, and this would be much 
less easily distinguished from zero. 
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Standardized parameter 

Fig. 2. Efficiency of logistic regression for estimation of 
relative to estimation with corresponding continuous responses 
from the logistic distribution, as a function of the standardized 
parameter S for the continuous model, for A = (solid), A = 1 
(dashes) and X = 2 (dots). 

If the same computation is applied to the urine 
data, then the numbers of continuous observations 
equivalent to the 77 binary observations range from 
7 to 19, depending on the parameter considered, 
with a value of 16 or so for ^lJ. In this light the 
difference between first order and higher order re- 
sults seems much more explicable: we are fitting a 
model with 6 nuisance parameters to the equiva- 
lent of fewer than 20 continuous observations, and 
so one would expect an appreciable "degrees of free- 
dom" adjustment. Section 4.2 of Brazzale, Davison 
and Reid (2007) gives related discussion. 



4.2 Exact Inference 

In a logistic regression model, exact inference for 
the interest parameter ^0 is available from the con- 
ditional density function of T given the value of S, 



(7) pr{T = t\S = s;^) 



exp(7/"' Zt/;) 



where Ag = {y.y^ X = s,y ^ {0, l}"-}. The function 
(7) does not depend on A, and this greatly sim- 
plifies inference (Cox (1958)). The main practical 
difficulty in using (7) is the enumeration of the el- 
ements of As, but recent computational advances 
have brought this into reach, at least in simple cases. 
One possibility is to use the network algorithm 
(Mehta and Patel (1995), Mehta, Patel and 
Senchaudhuri (2000)) provided by commercial soft- 
ware such as LogXact, but although helpful in sim- 
ple problems, it can be impracticably slow when 
there are several covariates. Forster, McDonald and 
Smith (2003) propose a Markov chain Monte Carlo 
algorithm for more complex models, but as their 
chain may be reducible, there is no guarantee that As 
would be fully explored even if the chain were to be 
run forever. Their algorithm has been 
implemented in the elrm package of R by 
Zamar, McNeney and Graham (2007). 

Apart from the enumeration of As, there are two 
deeper problems, both linked to the exactness of 
(7): the conservatism of exact tests and confidence 
intervals, which leads to overly-wide intervals and 




-0.08 -0.04 0.00 -0.08 -0.04 0.00 

Coefficient of urea Coefficient of urea 

Fig. 1. Graphical output comparing first (solid) and higher order (dashes) inference summaries for the urine data. Left: 
profile log likelihood £p{tp) (solid) and adjusted profile log likelihood lai^jj) (dashes), with horizontal gray lines indicating 0.95 
and 0.99 confidence sets for xp. Right: likelihood root r^ijj) (solid, curved) and modified likelihood root r''{il') (dashes, curved), 
and Wald pivots based on the profile likelihood (solid, straight) and the adjusted profile likelihood (dashes, straight). The 
horizontal gray lines are at the 0.025, 0.5 and 0.975 standard normal quantiles. 
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-7 -6 -5 -4 -3 -2 -1 -1 -6 -5 -4 -3 -2 -1 

t t 

Fig. 3. Exact conditional distribution ofT in logistic example (step function), with "3?{r*(t)} (solid), ${r(t)} (dashes), and 
${r*(t + 1/2)} (dots), shown for t = —6, .... 0. Left: -0 = 0. Right: — 0.05. The horizontal gray lines are at 0.025. 



overly-large P-values, and the fragility of exact con- 
ditional inference in certain discrete cases. We now 
discuss these, illustrated by data with n = 16 binary 
responses: 

y"" = (1 1 1 1 1 
1 0), 

covariate matrices 

T_l/2 2 2 2 2 2 2 
~2V-3 -3 -3 -3 -1 -1 -1 

2 2222222 2\ 
-111113 3 3 3/' 

Z'' = -(-3 -113-3-113 
2 ^ 

-3-113 -3 -1 1 3), 

and corresponding sufficient statistics s = y = 
(6, —3) and t = Z"^ y = —4. In this case the full sam- 
ple space has 2^^ elements, reducing to 8008 and 13 
elements respectively when conditioned on the first 
component of s and on both components of s. We 
use this example in the next two subsections. 

4.3 Conservatism of Exact Inference 

Exact inference in discrete response models typi- 
cally leads to conservative tests and confidence inter- 
vals. A striking illustration of this in the context of 
simple binomial models is given by Agresti and Coull 
(1998), who show how exact intervals such as that 
due to Clopper and Pearson (1934) are conservative 
for all values of the underlying parameter, while ap- 
proximate intervals based on likelihood quantities 
have overall coverage closer to the nominal level. 



For a variety of viewpoints on this and some 
solutions, see Agresti and Caffo (2000), 
Brown, Cai and DasGupta (2001) and Geyer and 
Meeden (2005). 

In the case of our simple example. Figure 3 shows 
the lower tail of the exact conditional distribution 
function of T when if: = {), obtained by comput- 
ing the generating function for the combinatorial 
terms; it has support on the set {— 6, — 5, . . . , 6}. 
Also shown are the approximate conditional distri- 
butions obtained by taking ${r(t;0)}, ${r-*(t;0)}, 
and <I>{r*(t + 1/2; 0)}, for a grid of values of t in the 
range (—6,6); the corresponding datasets were con- 
structed to retain the original value of the condition- 
ing statistic s. These approximations correspond re- 
spectively to first order and higher order procedures, 
and to use of the higher order procedure with a con- 
tinuity correction. Use of the function ${r*(t; -0)} 
for tp = {) yields a continuous approximation to the 
exact conditional distribution function that closely 
matches the mid-points of the jumps in the step- 
function and thus the mid-P value. 

Table 1 compares P-values and confidence limits 
for these data. The results for mid-P and the modi- 
fied likelihood root are fairly close, and give tighter 
inferences than does the exact solution, which is 
close to the modified likelihood root, plus continuity 
correction. The kink at t = in the approximations 
involving r* is due to a numerical instability. Al- 
though a different expression given as a limit for 
r ^ is available, it is rarely used in practice be- 
cause errors in P-values that are close to 0.5 are 
unimportant. 
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Table 1 

One-sided P-values for testing ijj = and limits of nominally 
equi-tailed 95% confidence intervals for ip, for the artificial 
logistic regression example 





P-values 


Limits of confidence 
interval 


Exact 


0.0528 


(-2.992, 0.158) 


mid-P 


0.0346 


(-2.690, 0.069) 


Wald 


0.0399 


(-2.572, 0.144) 


Wald, modified 


0.0475 


(-2.290, 0.183) 


Likelihood root, r 


0.0190 


(-2.950, -0.060) 


Modified likelihood root, r* 


0.0318 


(-2.506, 0.050) 


with continuity correction 


0.0557 


(-2.906, 0.172) 



The equi-tailed exact confidence interval {ip-,ip+) 
with level (1 — 2a) has limits given by the solutions 
to the equations 

pr(T > t \ S = s; ip-) = a, 

(8) 

pr(T <t\S = s; ip^) = a, 

whereas the limits of the intervals based on r and 
r* are the solutions in tp of the equations 

${r(t;^)} = a,l- a, ^{r*{t;ip)} = a,l-a, 

respectively. The right-hand panel of Figure 3 shows 
the conditional distribution for T for tp = 0.05, which 
for t < slightly depresses the probabilities rela- 
tive to taking ip = 0, and illustrates why the exact 
intervals are wider, and hence more conservative, 
than are these approximate ones: it is necessary to 
take V+ > 0.05 to satisfy the right-hand equation 
in (8); in fact, the first line of Table 1 shows that 
ip^ = 0.158 is required. 

4.4 Fragility of Exact Conditional Inference 

The second issue is the sensitivity of the set As, 
and hence of exact conditional inference, to the ma- 
trix X. It seems reasonable to require that small 
changes to X, for instance, due to rounding of the 
explanatory variables, should lead to small changes 
in confidence intervals and P-values. To test this, we 
jittered the second column X2 of the matrix X in the 
simple example of Section 4.2. When the elements of 
X2 were perturbed by adding normal noise with stan- 
dard deviation 0.01, rounded to 3 decimal places, the 
value of s changed to (6,-3.013), and the support 
points of the conditional distribution reduced from 
{—6, —5, . . . , 6} to {—4, —2, 0}. The exact tail prob- 
ability for t, pr(r < t \ S = s'jip = 0), changed from 



0.0528 to 0.3333, and mid-P from 0.0347 to 0.167, 
but ^{r*{t)} changed only from 0.0318 to 0.0316. 

An attempt to assess the fragility of the inference 
for the urine data failed: when the covariates are 
scaled to zero mean and unit variance, and rounded 
to the nearest integer, giving 5-6 rounded values 
for each covariate, one million iterations of the al- 
gorithm described by Forster, McDonald and Smith 
(2003), designed to enumerate the conditional sam- 
ple space for the urea effect, found 13 support points. 
A Markov chain run with rounding to the first deci- 
mal place failed to move at all, suggesting that, with 
this degree of precision in the covariates, the condi- 
tional distribution for the urea effect is degenerate. 
Thus, exact conditional inference seems to be out of 
reach for these data. 

To compare more systematically the effects of per- 
turbing the covariate on exact and approximate con- 
ditional inferences, we repeated this experiment 1000 
times, by adding noise with standard deviation 0.01 
to X2, and rounding to different precisions. Table 2 
gives the sizes of the resulting conditional sample 
spaces. Small changes to the covariates may sharply 
change the conditional sample space, and thus may 
severely affect exact conditional inferences and de- 
rived quantities such as mid-P values, but the ap- 
proximate conditional inferences barely change: for 
each level of rounding, the average values of r and 
r* were —2.076 and —1.855 across the 1000 simu- 
lated datasets, with standard errors of around 0.004 
and 0.0035 for all levels of rounding; the values of r 
and r* are of course constant when rounding to one 
decimal place. The mid-P values are computed with 
respect to the exact distribution, and hence are very 
sensitive to changes in the covariates; the 'approx- 
imating mid-P value' <l>(r*) might in such cases be 

Table 2 

Changes in number of support points of conditional sample 
space for a logistic regression model when a covariate is 
perturbed by adding small amounts of noise, rounded to 
different numbers of decimal places. As the number of 
decimal places increases, the conditioning becomes 
increasingly restrictive 



Decimal 


Number of support point 


s of conditional sample space 


places 


1 2 3 4 5 6 


7 8 9 10 11 12 13 


1 




1000 


2 


9 8 16 13 33 50 


65 103 150 206 181 138 28 


3 


129 212 191 167 141 101 


38 15 4 1 1 


4 


730 221 46 3 
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regarded as having been computed from an approxi- 
mating continuous distribution (Davison and Wang 
(2002)). 

These results supplement the finding of 
Pace, Salvan and Ventura (2004) that rounding of 
the response has little effect on higher order likeli- 
hood procedures. 

5. REGRESSION-SCALE MODELS 

5.1 Exact Inference 

Nonnormal linear models, also known as regression- 
scale models, belong to the wider class of transfor- 
mation models. They may be written using matrix 
notation as 

(9) y = X(5 + ae, 

where X is a fixed nxp design matrix with unknown 
regression coefficient /3 € M^, a > is a scale parame- 
ter, and £ = (ffi, . . . , e„) represents an n-dimensional 
vector of errors which are independently and iden- 
tically distributed according to a known though not 
necessarily normal density /(•) on M. If the maxi- 
mum likelihood estimates (/?, a) exist and are finite, 
there exists a one-to-one change of variables from 
y = {yi,---:yn) to 0, a, a), where ai = [yi -xl(5)/a, 
i = l,...,n, are the standardized residuals of the 
model and xj is the ith row of X. The pair {P,(t) 
forms a transformation variable, whereas the vector 
of standardized residuals a = (ai, . . . , a„) is ancillary 
with respect to P and a. As shown by Fraser (1979), 
Section 6.1.5, the functionally unique separation of 
(3 and a is obtained from the pivots Zi = [j3 — j3) / a 
and Z2 = (t/ct, whose joint distribution given a is 
known up to a normalizing constant. Fisher (1934), 
Fraser (1979) and others suggest that inference on 
the parameters (/?, a) should be made conditionally 
on a. Conditional confidence intervals for single pa- 
rameters are based on the marginal densities of the 
corresponding pivots, obtained by integrating out 
the remaining components in 

fZi,Z2\a{zi,Z2 I a) 

= c{a)zr^X{f{{x]z^+a,)z2}\X'' X\^/\ 

i=l 

Lawless (1972, 1973, 1978) gives applications to 
Cauchy, logistic, Weibull, and extreme value dis- 
tributions and, Kappenmann (1975) to the Laplace 
distribution under a location-scale model. 



Exact calculation of the marginal distribution for 
the pivots of interest usually involves multidimen- 
sional numerical integration, and can be difficult. 
For instance, the normalizing constant is given by 




n 

■'[lf{{xjzi+ai)z2} 
1=1 

■\X^X\'/^dzu 

■ ■ ■ dzipdz2, 

where zu = {Pi — Pi) /a. The required computational 
effort rapidly becomes infeasible, especially if the 
number of parameters is large and the dimension of 
the interest parameter is low. There are two ways 
to overcome this problem. The first is to use the 
higher order theory presented in Section 2, which 
applies rather naturally to regression-scale models. 
The tail area approximations (2) and (3) agree with 
those proposed by DiCiccio, Field and Fraser (1990) 
and by Fraser, Lee and Reid (1990) for the marginal 
distribution functions of the pivots Zi and Z2. All 
these methods are available through the package 
marg of the hoa bundle, which is equivalent in its 
design, syntax and use to the cond package. Ex- 
amples of application are given in Section 5.2 of 
Brazzale, Davison and Reid (2007) and in Section 
5.3.2 of Brazzale (2000). 

The second way to avoid numerical calculation of 
the normalizing constant in (10) is to use Markov 
chain Monte Carlo (MCMC) techniques to simulate 
from the conditional distribution. 

5.2 Monte Carlo Simulation 

Classical simulation techniques generate observa- 
tions that are independent and identically distributed 
by sampling directly from the target density. In our 
case this is not possible, as the normalizing con- 
stant c(a) in (10) is generally unknown. Among pos- 
sibilities for dealing with this are the conditional 
sampler available through the rsm. sample routine 
of the csampling package, which implements the 
Metropolis-Hastings algorithm. This routine sam- 
ples not from the conditional density (10) of the 
pivots, but from that of the maximum likelihood 
estimates (/3,ct), namely, 

//3,^|a(A^I«;/3,t^) 



ACCURATE SMALL-SAMPLE INFERENCE 



11 



(11) =c(a) 



n—p—l ^ 



a" 



i=l 



5.3 Conditional Distribution of Pivotal 
Quantities 

We considered a sample of size re = 10 from a lin- 
ear regression model with the 10 x 6 design matrix 





" 0.686 


0.640 


0.908 


0.886 


0.508 


Because of the one-to-one relationship between the 


0.566 


0.632 


0.130 


0.480 


0.669 


maximum likelihood estimates and the pivots (^1,^2) ^ 
given a, both approaches yield the same results, but ~ 
































sampling from (11) makes it easier to investigate the 

















distributions of higher order statistics. The pseudo- 


. 1 


1 


1 


1 


1 


code for the conditional sampler may be written as: 


0.255 


0.197 


0.056 


0.646 


0.317- 




0.930 


0.869 


0.204 


0.961 


0.321 


• Choose a candidate generation density /c(-). 


1 


1 


1 


1 


1 


• Choose an initial value (/3o,CTo)- 


0.255 


0.197 


0.056 


0.646 


0.317 


• For t=l,...,r 


0.930 


0.869 


0.204 


0.961 


0.321 




1 


1 


1 


1 


1 . 



1. Generate {Pc,^c) from /c(-). Take 



(/3c, ac) 

with probability tt, 
{/3t-i,at-i) 

with probability 1 — tt. 



where 



. ( fif3c,^c\ a; Po, (To) fc0t~i,^t~i) 
TT = mm< — , 

I /(/3t_i,o-t_i I a;/3o,cro)/c(/3c,o-c) ' 

a is the ancillary and {Pq,(7q) the 
simulation parameters . 
2. Reconstitute the sample yt — (j/it,-- 
where yu ^ x] $t + ^tai. 



The main implementation issue is the choice of /c(-). 
We found it best to make the transformation log a, 
giving the target density support RP'^^, and to sam- 
ple from a multivariate Student t distribution with 
low degrees of freedom and with shape close to that 
of the target density. Details can be found in Brazzale 
(2000), Chapter 7. 

The following two subsections summarize the re- 
sults of a study inspired by Example 3 of 
DiCiccio, Field and Fraser (1990). The rsm. sample 
routine was used to retrieve the empirical distribu- 
tion of the pivots Zi = {P — P)/a and = ct/o- for a 
fixed value of the ancillary statistic, and to investi- 
gate the empirical accuracy of the tail area approx- 
imations (2) and (3). 



and errors that follow the standard log-Weibull dis- 
tribution. This is a rather extreme scenario, with 
only 4 residual degrees of freedom and with highly 
correlated estimators of the regression coefficients. 
The sample configuration a on which to condition 
was chosen by random sampling from the standard 
log-Weibull distribution using the same parameter 
values as in DiCiccio, Field and Fraser (1990). We 
repeated the study for various choices of a, all of 
which yielded similar results. The candidate gener- 
ation density — a multivariate distribution — was 
rescaled and centered so as to optimize the accep- 
tance rate, of about 25% and 30%, as was assessed 
in a pilot study. According to Corollary 4 of Tierney 
(1994), the resulting Markov chain is uniformly er- 
godic. The sampler was run for T = 100,000 itera- 
tions, reached stationarity very quickly, and mixed 
well. The corresponding R code is given in the demon- 
stration file for the csampling package. 

Figure 4 shows the conditional distributions of the 
pivots Zi3 = {$3 — (3z)/a and logZ2 = \og{a/a) for 
a particular choice of the sample configuration a. 
Both distributions are notably nonnormal; the finite 
sample distribution of log Z2 is, furthermore, heavily 
biased. Non-normal distributions were also observed 
for the remaining five regression coefficients. Table 3 
compares the exact distribution functions of the two 
pivots with the approximations obtained from the 
likelihood root r and the modified likelihood root 
r*. The first order approximation performs rather 
poorly, especially for the scale parameter, while the 
third order solution competes well in this rather ex- 
treme scenario. 
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Table 3 

Exact and approximate tail probabilities for the pivots Z\z and log Z2 in the log- Weibull regression model with n = 10 and 
p = 6 considered by DiCiccio, Field and Eraser (1990), Example 3, for a particular choice of a. The approximations 
considered are <I>(r-) and <i>(r*), where r and r* are the likelihood root and its third order modification. The exact distribution 
was generated by the Metropolis-Hastings sampling (100,000 iterations, burn-in 5000) 



Pr(Zi3 < z) 



z 


-52.3 


-38.6 


-29.0 


-20.1 


21.8 


30.6 


40.6 


55.3 


exact 


0.01 


0.025 


0.05 


0.1 


0.9 


0.95 


0.975 


0.99 


$(r) 


< 0.001 


< 0.001 


0.001 


0.008 


0.996 


0.999 


1.000 


1.000 


$(r*) 


0.006 


0.016 


0.035 


0.082 


0.913 


0.962 


0.984 


0.995 










Pr(log Z2 










z 


-2.13 


-1.89 


-1.69 


-1.46 


-0.30 


-0.16 


-0.05 


0.07 


exact 


0.01 


0.025 


0.05 


0.1 


0.9 


0.95 


0.975 


0.99 


$(r) 


< 0.001 


< 0.001 


< 0.001 


< 0.001 


0.170 


0.295 


0.433 


0.586 




0.003 


0.009 


0.021 


0.052 


0.837 


0.912 


0.954 


0.979 



5.4 Accuracy of Higher Order Approximations 

Table 4 of DiCiccio, Field and Fraser (1990) re- 
ports the overall rates of noncoverage of the one- 
sided confidence intervals obtained from r and <&* 
for the parameters /?3 and logo" with a simula- 
tion of size 1000. However, as the authors themselves 
remark, these assessments are in terms of uncondi- 
tional rather than conditional coverage. 

Our Table 4 extends Table 4 of DiCiccio, Field 
and Fraser (1990): it gives the conditional rates of 
noncoverage of upper and lower confidence limits for 
the parameters /^s and logo" obtained from the 
signed likelihood root pivot r, and the third order 
tail area approximations (2) and (3), for a particular 
choice of a. For the regression coefficients, the like- 
lihood root yields confidence intervals which are too 
short, while the two higher order pivots work well. 
First order confidence intervals for logo" are heavily 



biased. Furthermore, we observed the feature men- 
tioned by DiCiccio, Field and Fraser (1990): the tail 
area approximation $*(r) breaks down. In about 
two-thirds of the samples the tail area exceeds 1. 
This is a drawback of Lugannani-Rice-type approx- 
imations, which need not give values within the in- 
terval (0,1). The modified likelihood root r* does 
not suffer from this drawback and provides satisfac- 
tory values. Some insight into why this happens is 
provided by Figure 5, which shows the normal Q~Q 
plots of r and r* for fi/^ and a. The finite-sample dis- 
tribution of r(o") is heavily biased with respect to the 
standard normal, whereas the tails of the distribu- 
tion of r(/34) are too heavy. As noted in the previous 
paragraph, the conditional distributions of the max- 
imum likelihood estimators are far from normal, so 
first order asymptotics are not useful. Surprisingly, 
r* works well for all seven parameters, especially 
since there are just n = 10 observations. 



0.025 

0.020 

0.015 H 

0.010 

0.005 

0.000 
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Fig. 4. Histograms of the pivots Zri and Z2 generated by the Metropolis-Hastings sampling (100,000 iterations) . Only every 
50th value is taken, after having discarded an initial sequence of length 5000. 



ACCURATE SMALL-SAMPLE INFERENCE 



13 



Normal Q-Q Plot (p*) Normal Q-Q Plot (o) 




-4 -2 2 4 -4 -2 2 4 

Theoretical Quantlles Theoretical Quantiles 



Fig. 5. Normal Q-Q plots of r (solid line) and r* (bold Ime) for jSi (left panel) and a (right panel) obtained from the 
Metropolis-Hastings sampling (100,000 iterations) , with diagonal line indicating perfect fit. The Q-Q plots only use every 
50th simulated value, after having discarded an initial sequence of length 5000. 



6. NONLINEAR MODELS 

Nonlinear models are widely used in applied statis- 
tics, especially for modeling dose-response curves. 
We consider the general form 

Vij = fi{xi;P) + w{xi;(3,p)£ij, 



(12) 

i = l,...,m,j = l,...,ni, 

where m is the number of design points, rii the num- 
ber of replicates at design point Xj, yij represents 
the response of the j'th experimental unit in the ith 



Table 4 

Conditional rates of noncoverage of confidence intervals for pi , jS^ and log a in the log- Weibull regression model with n = 10 
and p = 6 considered by DiCiccio, Field and Fraser (1990), Example 3, for a particular choice of a. The tail area 
approximations used are based on the likelihood root, $(r), on its higher order modification, 4>(r'*), and the third order 
quantity, $*(r). The coverages were calculated using a Metropolis-Hastings chain of length 100,000 after having discarded 
the first 5000 values. The coverages of confidence intervals for logo" using $*(r) are not given, as this statistic breaks down 





Nominal 




Upper confidence limit 






Lower confidence limit 




*(r) 


*(r*) 


**(r) 


*(r) 


*(r*) 


#*(r) 


Pi 


0.5 


7.96 


0.91 


0.41 


11.16 


0.89 


0.33 




1 


10.53 


1.71 


0.85 


14.05 


1.58 


0.52 




2.5 


15.28 


3.65 


2.24 


18.79 


3.36 


1.29 




5 


19.61 


6.37 


4.45 


23.07 


6.19 


2.81 




10 


25.19 


11.88 


9.77 


28.87 


11.76 


6.60 




25 


36.19 


25.98 


25.22 


39.05 


26.71 


23.63 




0.5 


8.29 


0.86 


0.35 


10.89 


1.03 


0.40 




1 


10.64 


1.73 


0.79 


14.29 


1.72 


0.67 




2.5 


15.15 


3.89 


2.22 


19.23 


3.69 


1.69 




5 


19.43 


6.66 


4.69 


23.58 


6.11 


3.32 




10 


25.24 


11.82 


9.63 


29.61 


11.30 


6.76 




25 


35.68 


26.24 


25.66 


40.05 


25.50 


21.83 


logo- 


0.5 


44.38 


1.53 




0.00 


0.34 






1 


53.46 


2.69 




0.00 


0.51 






2.5 


65.68 


5.66 




0.00 


1.17 






5 


75.14 


9.53 




0.01 


2.64 






10 


83.66 


17.10 




0.06 


5.97 






25 


93.46 


35.88 




0.40 


16.85 
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group, and the errors Eij are independent A^(0, 1) 
variates. The mean response is given by the nonhn- 
ear function fj,{xi;/3), which depends on a vector of 
unknown regression coefficients (3, while the function 
'w{xi; /3, p) may also depend on a vector p of variance 
parameters. If w{-)'^ is constant, (12) becomes the 
classical nonlinear regression model. Inference on (5 
and p is commonly based on first order approxima- 
tions and linearization techniques (Seber and Wild 
(1989), Chapter 5), plus graphical summaries such 
as profile and contour plots (Bates and Watts (1988), 
Section 6.1), which allow one to assess the quality of 
distributional approximations for the likelihood ra- 
tio and Wald statistics. Bellio and Brazzale (1999) 
show that nonlinearity of the mean function and 
variance heterogeneity can lead to substantial inac- 
curacies in first order inferences, especially for the 
variance parameters, unless the sample size is large. 
This may be overcome using the higher order so- 
lutions presented in Section 2, which are relatively 
easily derived in this case (Bellio, Jensen and Seiden 
(2000)). To do so, we re-write (12) curved expo- 
nential family of order (2m, d), where d is the dimen- 
sion of the parameter 6 = (/3, p). Expressions for the 
quantities needed to calculate the correction terms 
g(V'), M{ip) and u(^) are listed in 
Brazzale, Davison and Reid (2007), Sections 8.6.2 
and 8.6.3. We now present results of a data analysis 
performed with the nlreg package of the hoa bun- 
dle. Further examples may be found in 
Brazzale, Davison and Reid (2007), Chapters 5 and 
6. 

Example 2 (Herbicide bioassay). Data set C3 
of the nlreg package concerns an in vitro bioas- 
say on the action of the herbicide chlorsulfuron on 
the callus area of colonies of Brassica napus L., also 
known as oilseed rape. The experiment is described 
in Seiden, Kappel and Streibig (1998) and consists 
of n = 51 measurements of the callus area (in mm^) 
for m = 10 different dose levels (in nmol/1). The de- 
sign is unbalanced, as the number of replicates per 
dose varies from 5 to 8. Bellio, Jensen and Seiden 
(2000) discuss a model where the response variable 
is the logarithm of the callus area and the mean 
function is the logarithm of the four-parameter lo- 
gistic function 

(13) M^;/3)=/3i+ /^'f^ rr>0. 

This yields a sigmoidal curve which decreases from 
an initial value (32 to a limiting value (3i when the 



concentration x tends to infinity. The parameter /Js 
determines the shape of the curve, and (3^ corre- 
sponds to the EC50. A preliminary data analysis 
further suggests that the response variance might 
slightly decrease with its mean. The same authors 
suggest using the error-in- variable variance function 



w{x; P, K, 7, a 



2\2 



(14) 



K,7,cj^ >o, 



start 



control 



where k, 7 and cj^ are variance parameters. 

The R code for first and higher order inference for 
the nonlinear model defined by (13) and (14) is 

> C3.nl <- 

+ nlregCf ormula = log(area)~ 

log(bl+(b2-bl)/ 
(l+(dose/b4)'b3)) 
weights = ~ (l+( (k*dose~g* 
(b2-bl)-2)/ 
(l+(dose/b4)-b3)-4* 
b3~2*dose~(2*b3-2))/ 
b4- (2*b3) / (bl+ (b2-bl) / 
(l+(dose/b4)-b3))~2) , 
c(bl=2.2, b2=1700, 
b3=2.8, b4=0.28, 
g=2.7, k=l), 
= list(x.tol=le-12, 

rel.tol=le-12, 
step .inin=le-12) , 
data = C3, hoa = TRUE ) 

> summary ( C3.nl ) 

> C3.prof <- profileC C3.nl, 

+ offset = "all" ) 

> contourC C3.prof , offsetl = b2, 

+ offset2 = k, alpha = 0.95 ) 

> summary ( ria.prof, twoside = TRUE ) 

The formula and weights arguments in the nlreg 
fitting routine determine respectively the mean and 
variance functions p{-) and w{-)'^. The maximum 
likelihood estimate of o"^ is available in closed form, 
but starting values for the other parameters must be 
provided through the start argument. All compu- 
tations for o"^ use the logarithmic scale. Because of 
the highly nonlinear model structure, we must re- 
fine the convergence criteria through the control 
argument. We obtain f3i = 2.206 (0.415), (32 = 1662 
(117), /33 = 2.841 (0.360), /34 = 0.2752 (0.0452), 7 = 



2.605 (0.793), k = 1.009 (0.580) and logo"^ = -1. 
(0.234). The values in brackets are the standard er- 
rors, as returned by a call to summary. The option 
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hoa=TRUE indicates that higher order solutions will 
be used in the subsequent calculations. 

The profile and contour methods extend the 
original algorithm of Bates and Watts (1988), Chap- 
ter 6, to the higher order solutions presented in Sec- 
tion 2. The core routine is based upon pivot profiling 
as described in Section 3.3. By default, it computes 
the higher order statistics developed by Skovgaard 
(1996, 2001), although Fraser, Reid and Wu's (1999) 
version is available by setting stats="fr". The op- 
tion of f set="all" means that all model parameters 
are to be profiled. Figure 6 shows the 95% approxi- 
mate bivariate contour plots of the Wald, likelihood 
ratio and w* pivots for the parameters /?2 and k. 
The contours are plotted on the original scale (right 
panel) and on the r scale (left panel), respectively. 
On the latter scale the units are those of the likeli- 
hood root statistics. The more elliptical the contours 
are, the more quadratic is the likelihood; the closer 
the curves for w and w* , the better the behavior of 
first order inferences. The profile traces also shown 
represent the constrained maximum likelihood esti- 
mates of one parameter as a function of the other 
and show how the estimates affect each other. If the 
asymptotic correlation is zero, the angle between the 
traces is close to n/2, while an angle close to zero 
indicates strong correlation. 

Two-sided confidence intervals can be obtained 
using the summary function. The 95% confidence 
intervals for the parameter k are (—0.1270,2,145), 
(0.2697,3.096), (0.3264,3.546) and (0.3321,3.803) for 
respectively the Wald, r and r* pivots, these last 
computed using Skovgaard's (1996) and 
Fraser, Reid and Wu (1999) formulation. Both ver- 
sions of r* and the likelihood root, but not the Wald 

N(0,1) scale 




-3 -2-10 1 2 3 



statistic, let us reject the null hypothesis of a con- 
stant variance function at the 5% level. 

A common problem in nonlinear heteroscedastic 
regression is the estimation of the variance parame- 
ters, whose maximum likelihood estimators are usually 
heavily biased. More accurate estimates can be ob- 
tained from the adjusted profile likelihood using the 
fitting routine mpl: 

> C3.mpl <- mpK CS.nl ) 

> summary ( C3 . mpl ) 

We obtain % = 2.654 (0.834), ka = 1.081 (0.711) 
and logo"a = —1.825 (0.248), when the correction 
term M{ip) is based upon the p* density approxi- 
mation. The Ptem approximation yields 7a = 2.650 
(0.833), ka = 1.092 (0.721) and log^a = -1.827 (0.248) 
The standard errors are obtained from the profile 
information matrix jp{il^); this is possible because 

ba(V^)| = |jp(V^)|{l + Op(n-i)}andV'a-^ = Op(n-i). 
The distance between the values obtained from the 
profile likelihood and the adjusted profile likelihood 
functions gives an idea of the bias of the ordinary 
maximum likelihood estimators. It is reassuring that 
both versions of the adjusted profile likelihood yield 
similar estimates. 

7. DISCUSSION 

The purpose of this paper is to show that highly 
accurate likelihood methods may be routinely used 
in data analysis, both to check whether standard ap- 
proximations are adequate and to supplement them 
when they are inadequate. Libraries are available 
that implement these procedures for a variety of 



original scale 



1400 1600 1800 2000 
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Fig. 6. Herbicide bioassay: data set C3. Approximate bivariate contour plots and profile traces for the parameters (32 and 
K obtained with the contour method of the nlreg package (a — 0.05). The pivots used are as follows: likelihood root ( — ), 
modified likelihood root w* ( — ) and Wald ( ). Right panel: original scale; left panel: r scale. 
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common models. Although available for the numeri- 
cal computing environment R, it would be relatively 
straightforward to modify them for other packages. 
The classes of models discussed in this paper form a 
small subset of those used in practice, but the same 
ideas can be extended to other classes for which 
some unifying structure can be identified, so that 
a common statistical computation setup is possible; 
see, for example, Guolo, Brazzale and Salvan (2006). 
In other general approach requiring a small 

amount of programming is described by 
Brazzale, Davison and Reid (2007), Section 9.5. 

In our work we have focused on so-called dou- 
ble saddlepoint approximations, although sequen- 
tial saddlepoint approximations leading to inference 
based on expressions such as (5) should also be men- 
tioned (Pierce and Peters (1992); DiCiccio and Mar- 
tin (1993)). The maximum likelihood estimator ^/^a 
provides good point estimates, for example, when es- 
timating the variance parameters in a linear or non- 
linear regression model; the bias correction is essen- 
tially what is provided by the use of REML through 
maximization of the restricted likelihood function. 
If an adjusted profile log likelihood derived from (5) 
is available, comparison of it with the correspond- 
ing profile log likelihood £p{ip) gives valuable infor- 
mation about the bias of the maximum likelihood 
estimator, and if the adjusted profile log likelihood, 
logLa('i/'); is close to quadratic, then it should be 
safe to base inference for ^0 on the corresponding 
likelihood root statistic ra(0). However, if the pro- 
file log likelihoods are asymmetric, then inference 
based on r*{i{j) or, if available, r*{ip), will be prefer- 
able. For instance, the asymmetry of both curves in 
the left-hand panel of Figure 1 is a warning to avoid 
using Wald statistics. 

Different expressions are available for the correc- 
tion terms q{ip) and M{ip) in (4) and (5). Though al- 
most equivalent from the analytical and the numer- 
ical points of view, preference for one version or the 
other is not merely a matter of taste. Expression (15) 
in Appendix A.l requires the availability of an exact 
ancillary statistic, and this is rarely the case. Skov- 
gaard's (1996) sample space approximations circum- 
vent this problem, but require the calculation of the 
covariances that are involved. Expression (16) of the 
Appendix is more versatile both in its derivation and 
implementation, but it seems unclear if it applies to 
dependent data. Bellio and Sartori (2006) illustrate 
the versatility of the higher order methods discussed 
in this paper. 



A parallel literature on analytical approximations 
for Bayesian inference based on marginal posterior 
densities leads to remarkably similar expressions. 
Both the conceptual and mathematical developments 
are simpler, the first because arguments involving 
ancillarity are not required in the Bayesian paradigm, 
and the second because Laplace approximation for 
integrals is used (Tierney and Kadane (1986), 
Tierney, Kass and Kadane (1989), DiCiccio, Field 
and Eraser (1990)). Elementary expositions are 
given by Davison (2003), Section 11.3.1 and 
Brazzale, Davison and Reid (2007), Section 8.7. The 
relation with matching and noninformative priors 
(Tibshirani (1989), Reid, Mukerjee and Eraser (2002)) 
yields a close but imperfect rapprochement between 
objective Bayesian and Fisherian approaches. If so, 
the Holy Grail of objective statistical inference sought 
by Jeffreys and Fisher will have been reached — at 
least approximately! A recent example of this rap- 
prochement is given by Davison and Sartori (2008). 

Almost all the literature on higher order likeli- 
hood inference concerns regular models, yet non- 
regular situations are of growing interest. Testing 
for zero variance components plays a role in both 
spline smoothing applications and in generalized lin- 
ear mixed models, for example, and the boundary 
hypotheses this entails lead to modifications of the 
usual limiting distributions. It would be valuable to 
have accurate and practicable analytical approxima- 
tions for the more common nonregular situations. A 
first step in this direction is made by 
Castillo and Lopez- Ratera (2006). 

We have mainly discussed analytical approxima- 
tions, but DiCiccio, Martin and Stern (2001), 
Lee and Young (2005), and DiCiccio and Young 
(2008) have used the parametric bootstrap to achieve 
high accuracy. As yet, the properties of this ap- 
proach are understood only in certain, albeit impor- 
tant, cases, and rather large simulations seem to be 
needed for it to give solid gains over analytical ap- 
proximation, which we believe to be sufficiently ac- 
curate for most applications. Differences in the first 
two decimal places of a P-value may influence de- 
cisions taken in practice, while variation in further 
places is crucial only in exceptional cases. 

In this paper our concern is with implementation 
of accurate likelihood inference, which typically en- 
tails the elimination of parameters from likelihoods, 
by appropriate, often approximate, conditioning or 
marginalization. The different roles of conditioning 
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in inference have been aired at length in the liter- 
ature, and the interested reader may refer to Cox 
(1988) or Reid (1995), for instance, for more general 
discussion on this topic. The discussion of Section 4 
might be misinterpreted as an attack on conditional 
inference, but it is rather intended to point out that 
the properties of statistical procedures labelled 'ex- 
act' merit critical examination. So-called exact in- 
ference may come at too high a price. 

Why seek highly accurate inferences for a model 
that may be incorrect? Although we entirely agree 
that the robustness of conclusions plays a key role in 
applied work, we believe that this question is slightly 
beside the point. Provided the assumed model is 
found empirically acceptable, checking and, if nec- 
essary, improving the usual basis for inference seems 
worthwhile, even if the model might be falsified based 
on a larger sample. A Jesuitical reply might be: 
faced with an apparently Gaussian sample of size 
ten, should one base inference for its mean on a 
normal approximation or on the Student t distribu- 
tion? A reader who would opt for the latter should 
also be willing on occasion to use the approaches 
described above. It would be worrying if the higher 
order methods were very sensitive to model fail- 
ure, but published and unpublished work on this 
(Eraser, Wong and Wu (1999), Bellio (2000)), as well 
as our own limited simulations, suggest that there is 
no dramatic breakdown of them under small model 
perturbations. It would be reassuring to have more 
evidence of this, however. 

We hope that this paper will encourage others to 
use higher order methods in their applied work: a 
recipe may be appetizing in theory, but the proof of 
a pudding is in the eating. 

APPENDIX: HIGHER ORDER 
APPROXIMATIONS 

A.l Modified Likelihood Root 

The key element in using the modified likelihood 
root is the form of (4), whether computed using ip{0) 
or using equivalent expressions. 

Barndorff-Nielsen (1983) gives 



(15) 



i \m\ y^" 

1|jaa(^V')|J ' 



where the data vector y in i{0;y) is expressed as 
a one-to-one function of the maximum likelihood 
estimator 9 and of an ancillary statistic a, whose 
distribution does not depend on 9. Expression (15) 
is obtained by formally setting ^p{9y = £,^{6;6,a). 
The numerator on the right-hand side of (15) is the 
determinant of a (i x d matrix whose first column 
is the difference of sample-space derivatives i.g{9) — 

£.g{9^), defined as £.§(6) = £.g{9; 9, a) = d£{9; 9, a)/ 89, 
and whose remaining columns comprise the dx (d — 
1) matrix of mixed derivatives £^.g{9) = 

d'^£{Tp, X;9,a)/d9 dX^ evaluated at the constrained 
estimate 9^ . The denominator contains the dxd ma- 
trix of mixed derivatives £q.§{9) = d'^£{9; 9, a) /d9 86^ 

evaluated at 9. 

A difficulty in using (15) is the need to differen- 
tiate £{9;y) partially with respect to the maximum 
likelihood estimator 9, while holding fixed the value 
of a full-dimensional ancillary statistic a. 
Eraser, Reid and Wu (1999) bypass this difficulty in 
several ways. Eirst, they note that only a second or- 
der, approximate, ancillary is needed, so the results 
apply to a broader range of models. Second, the an- 
cillary is needed only at the observed data point, 
and this is given by tangents {Vi, . . . , V^) defined in 
the directions corresponding to fixed values of the 
ancillary. Third, any differentiation in these direc- 
tions is allowed, but differentiation with respect to 
9 is irrelevant. Thus, the nominal reparameteriza- 
tion can be given as if{9y = £-v{9;y) using direc- 
tional derivatives and yielding an expression which 
compares directly with (15), 



(16) 



_ \£.vi9)-£.v{§^) £x.v{e^)\ 
^\jxx{9i,)\' 



Here, v(^) is the canonical parameter of an exponen- 
tial family with sufficient statistic s = s{y) = 
8£{6^; y) / 89, the score function evaluated at the max- 
imum likelihood estimate 9^ for a fixed point y^, 
which approximates the true model locally at y^ 
with a relative error of the order 0{n~^^'^) and whose 
log likelihood function and first derivative with re- 
spect to 9 at the fixed point y^ equal those of the 
original model. The canonical parameter 93 is defined 
using a set of n vectors of length d through an n x d 
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matrix V, with rows Vi,. . . , Vn, where i-v{6; y) indi- 
cates that the log hkehhood is differentiated on the 
surface spanned by the columns of V. If the obser- 
vations yi,. . . ,yn are independent, then i-v{6; y) = 
Y^^=i^id^iO;y) / dyi. In the continuous case, the Vi 
can be constructed as 



V - 



dz 
dy^ 



dz 
W 



using a vector of pivotal quantities z = {zi (yi , 6*) , . . . , 
ZniUm G)}, where each component Zi{yi,9) has a fixed 
distribution under the model. In the continuous case 
such a vector always exists in the form of the prob- 
ability integral transformation F{yi;6). 

Skovgaard (1996) develops an approximation to 
qi, which avoids specification of a by approximating 
the sample space derivatives in (15) as 

^.e{e) - i./9^) = mr'j{e)Q{e,e^,), 

using moments of quantities such as the expected 
Fisher information i{6) and the covariances 



S{ei,e2) = coveAMGi),ie{92)}, 
h,92) = coveA^ei0i)Jiei)- 



The covariances 5 and Q are often readily com- 
puted, either analytically or by simulation, though 
the resulting tail area approximation has error 0{n~^) 
rather than 0(n~^/^). In an as-yet unpublished work, 
Fraser and Reid (2009) point out that this version 
may be obtained by replacing ip{6) by 
dE{i{e;y);eo}/d6o, evaluated at 60 = 9. 

A. 2 Adjusted Profile Likelihood 

The correction term M{ip) can be derived using 
the p* approximation, 

M,{i,) = \jxx{a^)\'/V\^x-:x0i')l 

which gives rise to Barndorff-Nielsen (1983) modi- 
fied profile likelihood, or using the tangent exponen- 
tial model, 

M2(V') = |jaa(^v)I'^V|</'a(^^)"Va|, 

where V\ is the n x {d — do) matrix obtained from 
V = V{6) by omitting the columns which relate to 
the parameter of interest (Fraser (2003)). 
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